Goto

Collaborating Authors

 hate speech


Seeing Hate Differently: Hate Subspace Modeling for Culture-Aware Hate Speech Detection

Cai, Weibin, Zafarani, Reza

arXiv.org Artificial Intelligence

Hate speech detection has been extensively studied, yet existing methods often overlook a real-world complexity: training labels are biased, and interpretations of what is considered hate vary across individuals with different cultural backgrounds. We first analyze these challenges, including data sparsity, cultural entanglement, and ambiguous labeling. To address them, we propose a culture-aware framework that constructs individuals' hate subspaces. To alleviate data sparsity, we model combinations of cultural attributes. For cultural entanglement and ambiguous labels, we use label propagation to capture distinctive features of each combination. Finally, individual hate subspaces, which in turn can further enhance classification performance. Experiments show our method outperforms state-of-the-art by 1.05\% on average across all metrics.


WATCHED: A Web AI Agent Tool for Combating Hate Speech by Expanding Data

Piot, Paloma, Sánchez, Diego, Parapar, Javier

arXiv.org Artificial Intelligence

Online harms are a growing problem in digital spaces, putting user safety at risk and reducing trust in social media platforms. One of the most persistent forms of harm is hate speech. To address this, we need tools that combine the speed and scale of automated systems with the judgment and insight of human moderators. These tools should not only find harmful content but also explain their decisions clearly, helping to build trust and understanding. In this paper, we present WATCHED, a chatbot designed to support content moderators in tackling hate speech. The chatbot is built as an Artificial Intelligence Agent system that uses Large Language Models along with several specialised tools. It compares new posts with real examples of hate speech and neutral content, uses a BERT-based classifier to help flag harmful messages, looks up slang and informal language using sources like Urban Dictionary, generates chain-of-thought reasoning, and checks platform guidelines to explain and support its decisions. This combination allows the chatbot not only to detect hate speech but to explain why content is considered harmful, grounded in both precedent and policy. Experimental results show that our proposed method surpasses existing state-of-the-art methods, reaching a macro F1 score of 0.91. Designed for moderators, safety teams, and researchers, the tool helps reduce online harms by supporting collaboration between AI and human oversight.


Debunking with Dialogue? Exploring AI-Generated Counterspeech to Challenge Conspiracy Theories

Lisker, Mareike, Gottschalk, Christina, Mihaljević, Helena

arXiv.org Artificial Intelligence

Counterspeech is a key strategy against harmful online content, but scaling expert-driven efforts is challenging. Large Language Models (LLMs) present a potential solution, though their use in countering conspiracy theories is under-researched. Unlike for hate speech, no datasets exist that pair conspiracy theory comments with expert-crafted counterspeech. We address this gap by evaluating the ability of GPT-4o, Llama 3, and Mistral to effectively apply counterspeech strategies derived from psychological research provided through structured prompts. Our results show that the models often generate generic, repetitive, or superficial results. Additionally, they over-acknowledge fear and frequently hallucinate facts, sources, or figures, making their prompt-based use in practical applications problematic.


Automated Safety Evaluations Across 20 Large Language Models: The Aymara LLM Risk and Responsibility Matrix

Contreras, Juan Manuel

arXiv.org Artificial Intelligence

As large language models (LLMs) become increasingly integrated into real-world applications, scalable and rigorous safety evaluation is essential. This paper introduces Aymara AI, a programmatic platform for generating and administering customized, policy-grounded safety evaluations. Aymara AI transforms natural-language safety policies into adversarial prompts and scores model responses using an AI-based rater validated against human judgments. We demonstrate its capabilities through the Aymara LLM Risk and Responsibility Matrix, which evaluates 20 commercially available LLMs across 10 real-world safety domains. Results reveal wide performance disparities, with mean safety scores ranging from 86.2% to 52.4%. While models performed well in well-established safety domains such as Misinformation (mean = 95.7%), they consistently failed in more complex or underspecified domains, notably Privacy & Impersonation (mean = 24.3%). Analyses of Variance confirmed that safety scores differed significantly across both models and domains (p < .05). These findings underscore the inconsistent and context-dependent nature of LLM safety and highlight the need for scalable, customizable tools like Aymara AI to support responsible AI development and oversight.


A Modular Taxonomy for Hate Speech Definitions and Its Impact on Zero-Shot LLM Classification Performance

Melis, Matteo, Lapesa, Gabriella, Assenmacher, Dennis

arXiv.org Artificial Intelligence

Detecting harmful content is a crucial task in the landscape of NLP applications for Social Good, with hate speech being one of its most dangerous forms. But what do we mean by hate speech, how can we define it, and how does prompting different definitions of hate speech affect model performance? The contribution of this work is twofold. At the theoretical level, we address the ambiguity surrounding hate speech by collecting and analyzing existing definitions from the literature. We organize these definitions into a taxonomy of 14 Conceptual Elements-building blocks that capture different aspects of hate speech definitions, such as references to the target of hate (individual or groups) or of the potential consequences of it. At the experimental level, we employ the collection of definitions in a systematic zero-shot evaluation of three LLMs, on three hate speech datasets representing different types of data (synthetic, human-in-the-loop, and real-world). We find that choosing different definitions, i.e., definitions with a different degree of specificity in terms of encoded elements, impacts model performance, but this effect is not consistent across all architectures.


A Comparative Analysis of Ethical and Safety Gaps in LLMs using Relative Danger Coefficient

Tereshchenko, Yehor, Hämäläinen, Mika

arXiv.org Artificial Intelligence

Artificial Intelligence (AI) and Large Language Models (LLMs) have rapidly evolved in recent years, showcasing remarkable capabilities in natural language understanding and generation. However, these advancements also raise critical ethical questions regarding safety, potential misuse, discrimination and overall societal impact. This article provides a comparative analysis of the ethical performance of various AI models, including the brand new DeepSeek-V3(R1 with reasoning and without), various GPT variants (4o, 3.5 Turbo, 4 Turbo, o1/o3 mini) and Gemini (1.5 flash, 2.0 flash and 2.0 flash exp) and highlights the need for robust human oversight, especially in situations with high stakes. Furthermore, we present a new metric for calculating harm in LLMs called Relative Danger Coefficient (RDC).


LLMs for Translation: Historical, Low-Resourced Languages and Contemporary AI Models

Tekgurler, Merve

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have demonstrated remarkable adaptability in performing various tasks, including machine translation (MT), without explicit training. Models such as OpenAI's GPT-4 and Google's Gemini are frequently evaluated on translation benchmarks and utilized as translation tools due to their high performance. This paper examines Gemini's performance in translating an 18th-century Ottoman Turkish manuscript, Prisoner of the Infidels: The Memoirs of Osman Agha of Timisoara, into English. The manuscript recounts the experiences of Osman Agha, an Ottoman subject who spent 11 years as a prisoner of war in Austria, and includes his accounts of warfare and violence. Our analysis reveals that Gemini's safety mechanisms flagged between 14 and 23 percent of the manuscript as harmful, resulting in untranslated passages. These safety settings, while effective in mitigating potential harm, hinder the model's ability to provide complete and accurate translations of historical texts. Through real historical examples, this study highlights the inherent challenges and limitations of current LLM safety implementations in the handling of sensitive and context-rich materials. These real-world instances underscore potential failures of LLMs in contemporary translation scenarios, where accurate and comprehensive translations are crucial-for example, translating the accounts of modern victims of war for legal proceedings or humanitarian documentation.


Hate Speech and Sentiment of YouTube Video Comments From Public and Private Sources Covering the Israel-Palestine Conflict

Hofmann, Simon, Sommermann, Christoph, Kraus, Mathias, Zschech, Patrick, Rosenberger, Julian

arXiv.org Artificial Intelligence

This study explores the prevalence of hate speech (HS) and sentiment in YouTube video comments concerning the Israel-Palestine conflict by analyzing content from both public and private news sources. The research involved annotating 4983 comments for HS and sentiments (neutral, pro-Israel, and pro-Palestine). Subsequently, machine learning (ML) models were developed, demonstrating robust predictive capabilities with area under the receiver operating characteristic (AUROC) scores ranging from 0.83 to 0.90. These models were applied to the extracted comment sections of YouTube videos from public and private sources, uncovering a higher incidence of HS in public sources (40.4%) compared to private sources (31.6%). Sentiment analysis revealed a predominantly neutral stance in both source types, with more pronounced sentiments towards Israel and Palestine observed in public sources. This investigation highlights the dynamic nature of online discourse surrounding the Israel-Palestine conflict and underscores the potential of moderating content in a politically charged environment.


AfriHate: A Multilingual Collection of Hate Speech and Abusive Language Datasets for African Languages

Muhammad, Shamsuddeen Hassan, Abdulmumin, Idris, Ayele, Abinew Ali, Adelani, David Ifeoluwa, Ahmad, Ibrahim Said, Aliyu, Saminu Mohammad, Onyango, Nelson Odhiambo, Wanzare, Lilian D. A., Rutunda, Samuel, Aliyu, Lukman Jibril, Alemneh, Esubalew, Hourrane, Oumaima, Gebremichael, Hagos Tesfahun, Ismail, Elyas Abdi, Beloucif, Meriem, Jibril, Ebrahim Chekol, Bukula, Andiswa, Mabuya, Rooweither, Osei, Salomey, Oppong, Abigail, Belay, Tadesse Destaw, Guge, Tadesse Kebede, Asfaw, Tesfa Tegegne, Chukwuneke, Chiamaka Ijeoma, Röttger, Paul, Yimam, Seid Muhie, Ousidhoum, Nedjma

arXiv.org Artificial Intelligence

Hate speech and abusive language are global phenomena that need socio-cultural background knowledge to be understood, identified, and moderated. However, in many regions of the Global South, there have been several documented occurrences of (1) absence of moderation and (2) censorship due to the reliance on keyword spotting out of context. Further, high-profile individuals have frequently been at the center of the moderation process, while large and targeted hate speech campaigns against minorities have been overlooked. These limitations are mainly due to the lack of high-quality data in the local languages and the failure to include local communities in the collection, annotation, and moderation processes. To address this issue, we present AfriHate: a multilingual collection of hate speech and abusive language datasets in 15 African languages. Each instance in AfriHate is annotated by native speakers familiar with the local culture. We report the challenges related to the construction of the datasets and present various classification baseline results with and without using LLMs. The datasets, individual annotations, and hate speech and offensive language lexicons are available on https://github.com/AfriHate/AfriHate


Digital Guardians: Can GPT-4, Perspective API, and Moderation API reliably detect hate speech in reader comments of German online newspapers?

Weber, Manuel, Huber, Moritz, Auch, Maximilian, Döschl, Alexander, Keller, Max-Emanuel, Mandl, Peter

arXiv.org Artificial Intelligence

In recent years, toxic content and hate speech have become widespread phenomena on the internet. Moderators of online newspapers and forums are now required, partly due to legal regulations, to carefully review and, if necessary, delete reader comments. This is a labor-intensive process. Some providers of large language models already offer solutions for automated hate speech detection or the identification of toxic content. These include GPT-4o from OpenAI, Jigsaw's (Google) Perspective API, and OpenAI's Moderation API. Based on the selected German test dataset HOCON34k, which was specifically created for developing tools to detect hate speech in reader comments of online newspapers, these solutions are compared with each other and against the HOCON34k baseline. The test dataset contains 1,592 annotated text samples. For GPT-4o, three different promptings are used, employing a Zero-Shot, One-Shot, and Few-Shot approach. The results of the experiments demonstrate that GPT-4o outperforms both the Perspective API and the Moderation API, and exceeds the HOCON34k baseline by approximately 5 percentage points, as measured by a combined metric of MCC and F2-score.